-
Notifications
You must be signed in to change notification settings - Fork 29k
Revert "[SPARK-21052][SQL] Add hash map metrics to join" #23204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Cluster info:
Related parameters setting:
In above test environment, we found a serious performance degradation issue in Spark2.3 when running TPC-DS on SKX 8180. We investigated this problem and figured out the root cause is in community patch SPARK-21052 which add metrics to hash join process. And the impact code is L486 and L487 . Following is the result of TPC-DS Q19 in spark2.1, spark2.3 remove L486&487, spark2.3 add L486&487 and spark2.4.
|
|
ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's easy to just run the git revert command, but I'd like to manually revert it, since that PR was merged long time ago. And we should still keep changes like this renaming, as they are not quite related to the performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, it's better to put this code block here, let's keep this change.
|
I'm fine to revert it if it caused a significant performance regression, we should revisit it later, with different ideas, like updating the metrics for each batch instead of each record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep the changes in this file as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If as your test shows this is the cause of performance regression, we can just revert this and related changes. The change in HashAggregateExec, etc. can be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Is this observable in general hash join query, except for TPC-DS Q19? |
|
Test build #99615 has finished for PR 23204 at commit
|
|
@cloud-fan @viirya #23214 maybe reslove this problem and we needn't revert this patch. |
|
Hi, @LuciferYang . If we are not going to revert this, could you close this PR? |
|
according to #23214 (comment) , the hash join metrics is wrongly implemented. I think it's fine to revert it and re-implement it later. @JkSelf can you address the comments and only revert the hash join part? thanks! |
|
The result of all queries in tpcds with 1TB data scale is in tpcds result |
|
@cloud-fan ok, i will revert as your comments later. |
|
@cloud-fan and @JkSelf . |
|
+1 |
|
@cloud-fan If we decide to partial revert SPARK-21052 and no need for #23214, I will close it. |
|
If we can quickly finish #23214 (within several days), let's go for it. But if we can't, I'd suggest we do the partial revert first to fix the perf regression, and add back the metrics later. |
|
@cloud-fan @dongjoon-hyun update the patch, please help review if you have time. Thanks. |
|
ok~ already close #23214 |
|
can we follow #23204 (comment) and create a new ticket? |
|
Test build #99894 has finished for PR 23204 at commit
|
|
@cloud-fan the new ticket is in here. I will close this ticket. |
| } | ||
|
|
||
| // At the end of the task, we update the avg hash probe. | ||
| TaskContext.get().addTaskCompletionListener[Unit](_ => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this file, the join method takes avgHashProbe: SQLMetric, we should remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan updated in new PR #23269 Thanks.
|
I'll close this in order to collect the reviews into new PR, #23269 . |
Partial revert SPARK-26155, because of the performance degradation in TPC-DS result with 1TB data scale.